Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
Partition-based incremental processing method for string similarity join
YAN Cairong, ZHU Bin, WANG Jian, HUANG Yongfeng
Journal of Computer Applications    2016, 36 (1): 27-32.   DOI: 10.11772/j.issn.1001-9081.2016.01.0027
Abstract439)      PDF (890KB)(367)       Save
String similarity join is an essential operation of data quality management and a key step to find the value of data. Now in the era of big data, since the existing methods can not meet the demands of incremental processing, an incremental string similarity join method oriented streaming data, called Inc-Join, was proposed. And the string index technique was optimized. Firstly, based on the Pass-Join string join algorithm, strings were divided into some disjoint substrings by utilizing partition technique; secondly, the inverted index of strings was created and acted as a state; finally, the similarity calculation was done according to the state when new data came, and the state would be updated after each operation of string similarity join. The experimental results show that Inc-Join method can reduce the number of reduplicate matching between short or long strings to √ n( n is the number of matching with batching processing model) without affecting the join accuracy. The elapsed time of string similarity join with batching processing model was 1 to 4.7 times the time Inc-Join needs when three different datasets were processed, and it tended to increase sharply. And the minimum elapsed time of optimized Inc-Join only accounted for 3/4 of original elapsed time of Inc-Join. With the increasing number of strings, the elapsed time of optimized Inc-Join would account for less and less of proportion in original elapsed time. The state need not to be saved, so the optimized Inc-Join further reduces time and space cost of Inc-Join.
Reference | Related Articles | Metrics
Optimization of small files storage and accessing on Hadoop distributed file system
LI Tie YAN Cairong HUANG Yongfeng Song Yalong
Journal of Computer Applications    2014, 34 (11): 3091-3095.   DOI: 10.11772/j.issn.1001-9081.2014.11.3091
Abstract323)      PDF (800KB)(7673)       Save

In order to improve the efficiency of processing small files in Hadoop Distributed File System (HDFS), a new efficient approach named SmartFS was proposed. By analyzing the file accessing log to obtain the accessing behavior of users, SmartFS established a probability model of file associations. This model was the reference of merging algorithm to merge the relevant small files into large files which would be stored on HDFS. When a file was accessed, SmartFS prefetched the related files according to the prefetching algorithm to accelerate the access speed. To guarantee the enough cache space, a cache replacement algorithm was put forward. The experimental results show that SmartFS can save the metadata space of NameNode in HDFS, reduce the interaction between users and HDFS, and improve the storing and accessing efficiency of small files on HDFS.

Reference | Related Articles | Metrics